Geospatial Crop Targeting System for Cambodia¶
Data Science Workflow: Soil-Based Agricultural Zoning¶
Project: National Soil Information and Land Suitability Evaluation System
Objective: Develop an evidence-based system for agricultural zone classification using geospatial data and machine learning.
Workflow Overview¶
| Phase | Cell | Method |
|---|---|---|
| 1. Problem Definition | Intro | Business objectives & success metrics |
| 2. Data Acquisition | 1-2 | Load shapefile & production data |
| 3. EDA | 3-4 | Statistical summaries & distributions |
| 4. Data Cleaning | 5 | Handle missing values & CRS alignment |
| 5. Feature Engineering | 6-7 | Rule-based classification & preprocessing |
| 6. Visualization | 8-9 | Distributions, correlations, heatmaps |
| 7. ML Implementation | 10-11 | K-Means clustering & model evaluation |
| 8. Model Interpretation | 12 | Cluster characterization & insights |
| 9. Advanced Visualization | 13-14 | Interactive mapping & PCA analysis |
| 10. Results Export | 15 | Multi-format output generation |
Environment Setup & Configuration¶
import pandas as pd
import numpy as np
from pathlib import Path
import geopandas as gpd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score, davies_bouldin_score
import matplotlib.pyplot as plt
import seaborn as sns
import folium
import warnings
warnings.filterwarnings("ignore")
# visualization settings
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (12, 6)
plt.rcParams["font.size"] = 10
pd.set_option("display.max_columns", None)
pd.set_option("display.width", None)
pd.set_option("display.max_rows", 100)
2: Data Acquisition & Initial Loading¶
# Resolve data paths relative to repository root
notebook_dir = Path(".").resolve()
base_path = notebook_dir.parent if notebook_dir.name == "src" else notebook_dir
shapefile_path = base_path / "soil-type" / "ST.shp"
crop_prod_path = base_path / "dataset" / "crop_prod.csv"
# Load geospatial data
gdf_soil = gpd.read_file(shapefile_path, engine="pyogrio")
gdf_soil.rename(columns={"Name": "Soil_Type"}, inplace=True)
# Load crop production reference
df_crops_ref = pd.read_csv(crop_prod_path)
print(f"Geospatial data: {gdf_soil.shape[0]} polygons, CRS: {gdf_soil.crs}")
print(f"Crop reference data: {df_crops_ref.shape}")
Geospatial data: 17 polygons, CRS: EPSG:32648 Crop reference data: (66, 2)
3: Exploratory Data Analysis - Geospatial Data¶
# Display geospatial data overview
display(gdf_soil.describe())
display(gdf_soil.info())
# Soil type distribution
soil_counts = gdf_soil["Soil_Type"].value_counts()
fig, ax = plt.subplots(figsize=(12, 6))
soil_counts.plot(kind="barh", ax=ax, color="steelblue")
ax.set_title("Soil Type Distribution Across Cambodia", fontsize=12, fontweight="bold")
ax.set_xlabel("Number of Polygons")
plt.tight_layout()
plt.show()
# Check for missing values
missing_summary = gdf_soil.isnull().sum()
if missing_summary.any():
display(missing_summary[missing_summary > 0])
else:
print("No missing values in geospatial dataset")
| Soil_Type | geometry | |
|---|---|---|
| count | 17 | 17 |
| unique | 17 | 17 |
| top | Acid Lithosols | MULTIPOLYGON (((309750.55179999955 1180936.461... |
| freq | 1 | 1 |
<class 'geopandas.geodataframe.GeoDataFrame'> RangeIndex: 17 entries, 0 to 16 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Soil_Type 17 non-null object 1 geometry 17 non-null geometry dtypes: geometry(1), object(1) memory usage: 404.0+ bytes
None
No missing values in geospatial dataset
CELL 4: Data Cleaning - CRS Handling & Validation¶
# Validate and reproject CRS if necessary
if gdf_soil.crs != "EPSG:4326":
print(f"Reprojecting from {gdf_soil.crs} to EPSG:4326 for web mapping...")
gdf_soil = gdf_soil.to_crs("EPSG:4326")
print(f"Reprojection complete. New CRS: {gdf_soil.crs}")
# Validate geometry
invalid_geom = gdf_soil[~gdf_soil.geometry.is_valid]
if len(invalid_geom) > 0:
print(f"Found {len(invalid_geom)} invalid geometries. Fixing...")
gdf_soil.geometry = gdf_soil.geometry.buffer(0)
print("Geometry validation complete")
else:
print("All geometries valid")
# Display final geospatial dataset
display(gdf_soil.head(1))
print(
f"\nCleaned dataset: {gdf_soil.shape[0]} polygons, {len(gdf_soil.columns)} columns"
)
Reprojecting from EPSG:32648 to EPSG:4326 for web mapping... Reprojection complete. New CRS: EPSG:4326 Found 1 invalid geometries. Fixing... Geometry validation complete
| Soil_Type | geometry | |
|---|---|---|
| 0 | Acid Lithosols | MULTIPOLYGON (((107.62943 13.40398, 107.62603 ... |
Cleaned dataset: 17 polygons, 2 columns
5: Province-Level Data Engineering¶
# Create comprehensive province dataset with synthetic crop production data
provinces_cambodia = [
"Banteay Meanchey",
"Battambang",
"Kampong Cham",
"Kampong Chhnang",
"Kampong Speu",
"Kampong Thom",
"Kandal",
"Kep",
"Koh Kong",
"Kratie",
"Mondolkiri",
"Oddar Meanchey",
"Pailin",
"Phnom Penh",
"Preah Sihanouk",
"Preah Vihear",
"Prey Veng",
"Pursat",
"Ratanakiri",
"Siem Reap",
"Stung Treng",
"Svay Rieng",
"Takev",
"Tbong Khmum",
]
np.random.seed(42)
df_provinces = pd.DataFrame(
{
"Province_Name": provinces_cambodia,
"Rice_Yield_Ton_Ha": np.random.uniform(2.5, 4.5, len(provinces_cambodia)),
"Cassava_Yield_Ton_Ha": np.random.uniform(10, 15, len(provinces_cambodia)),
"Mango_Density": np.random.uniform(20, 80, len(provinces_cambodia)),
"Corn_Yield_Ton_Ha": np.random.uniform(3.0, 6.0, len(provinces_cambodia)),
"Soybean_Yield_Ton_Ha": np.random.uniform(1.5, 3.5, len(provinces_cambodia)),
"Rubber_Yield_Ton_Ha": np.random.uniform(2.0, 4.0, len(provinces_cambodia)),
"Sugarcane_Yield_Ton_Ha": np.random.uniform(50, 90, len(provinces_cambodia)),
"Pepper_Yield_Kg_Ha": np.random.uniform(1000, 2500, len(provinces_cambodia)),
"PalmOil_Yield_Ton_Ha": np.random.uniform(10, 20, len(provinces_cambodia)),
}
)
# Display statistics
display(df_provinces.describe())
| Rice_Yield_Ton_Ha | Cassava_Yield_Ton_Ha | Mango_Density | Corn_Yield_Ton_Ha | Soybean_Yield_Ton_Ha | Rubber_Yield_Ton_Ha | Sugarcane_Yield_Ton_Ha | Pepper_Yield_Kg_Ha | PalmOil_Yield_Ton_Ha | |
|---|---|---|---|---|---|---|---|---|---|
| count | 24.000000 | 24.000000 | 24.000000 | 24.000000 | 24.000000 | 24.000000 | 24.000000 | 24.000000 | 24.000000 |
| mean | 3.380325 | 12.291819 | 50.325207 | 4.529972 | 2.443227 | 2.992980 | 67.959784 | 1709.038454 | 14.962874 |
| std | 0.582669 | 1.494284 | 19.698128 | 0.850749 | 0.609111 | 0.600045 | 11.213008 | 454.981492 | 3.078611 |
| min | 2.541169 | 10.171943 | 22.713637 | 3.016566 | 1.550838 | 2.013904 | 51.475478 | 1024.881743 | 10.050616 |
| 25% | 2.866019 | 10.961932 | 31.881949 | 3.964899 | 1.936482 | 2.577209 | 59.563232 | 1320.195158 | 12.086092 |
| 50% | 3.240902 | 12.378117 | 47.941201 | 4.776863 | 2.436112 | 2.844519 | 67.144718 | 1781.307727 | 16.197653 |
| 75% | 3.771816 | 13.339750 | 68.529927 | 5.212203 | 3.035163 | 3.620168 | 75.727247 | 2021.361559 | 17.156233 |
| max | 4.439820 | 14.828160 | 79.213216 | 5.661638 | 3.359395 | 3.943564 | 89.426018 | 2405.094983 | 19.004181 |
6: Data Integration & Mapping¶
# Create soil-to-province mapping (comprehensive - handles multiple soil type names)
soil_to_province_mapping = {
"Acid Lithosols": "Mondolkiri",
"Alluvial Lithosols": "Kratie",
"Alumisols": "Ratanakiri",
"Basic Lithosols": "Stung Treng",
"Brown Alluvial Soils": "Kampong Cham",
"Brown hydromorphics": "Kampong Cham",
"Calcaric Cambisols": "Pursat",
"Cambisols": "Battambang",
"Chromic Cambisols": "Banteay Meanchey",
"Coastal Complex": "Preah Sihanouk",
"Cultural hydromorphics": "Kandal",
"Eutric Fluvisols": "Kampong Thom",
"Ferric Acrisols": "Siem Reap",
"Great Lake": "Kampong Thom",
"Grey hydromorphics": "Prey Veng",
"Greyic Luvisols": "Oddar Meanchey",
"Mollic Planosols": "Preah Vihear",
"Pellic Vertisols": "Svay Rieng",
"Rhodic Ferralsols": "Kandal",
"Sandsols": "Koh Kong",
"Tropical Black Earths": "Prey Veng",
"Water": "Preah Sihanouk",
}
# Apply mapping with fallback to closest province
gdf_soil["Province_Name"] = gdf_soil["Soil_Type"].map(soil_to_province_mapping)
# For unmapped soil types, assign default province (ensures no NaN in Province_Name)
unmapped_mask = gdf_soil["Province_Name"].isna()
if unmapped_mask.any():
print(f"Unmapped soil types found: {gdf_soil[unmapped_mask]['Soil_Type'].unique()}")
gdf_soil.loc[unmapped_mask, "Province_Name"] = "Kandal" # Default province
gdf_merged = gdf_soil.merge(df_provinces, on="Province_Name", how="left")
# Handle missing values
numeric_cols = [
"Rice_Yield_Ton_Ha",
"Cassava_Yield_Ton_Ha",
"Mango_Density",
"Corn_Yield_Ton_Ha",
"Soybean_Yield_Ton_Ha",
"Rubber_Yield_Ton_Ha",
"Sugarcane_Yield_Ton_Ha",
"Pepper_Yield_Kg_Ha",
"PalmOil_Yield_Ton_Ha",
]
gdf_merged[numeric_cols] = gdf_merged[numeric_cols].fillna(0)
print(f"Merged dataset: {gdf_merged.shape[0]} records with {len(numeric_cols)} crops")
display(
gdf_merged[
[
"Soil_Type",
"Province_Name",
"Rice_Yield_Ton_Ha",
"Cassava_Yield_Ton_Ha",
"Mango_Density",
"Corn_Yield_Ton_Ha",
"Soybean_Yield_Ton_Ha",
"Rubber_Yield_Ton_Ha",
]
].head(10)
)
Unmapped soil types found: ['Lacustrine Alluvial Soils' 'Latosols' 'Planosols' 'Plinthite podzols' 'Plinthitic hydromorphics' 'Red-yellow podzols' 'Regurs'] Merged dataset: 17 records with 9 crops
| Soil_Type | Province_Name | Rice_Yield_Ton_Ha | Cassava_Yield_Ton_Ha | Mango_Density | Corn_Yield_Ton_Ha | Soybean_Yield_Ton_Ha | Rubber_Yield_Ton_Ha | |
|---|---|---|---|---|---|---|---|---|
| 0 | Acid Lithosols | Mondolkiri | 2.541169 | 14.828160 | 22.713637 | 3.992694 | 2.320766 | 2.834822 |
| 1 | Alluvial Lithosols | Kratie | 3.916145 | 14.744428 | 31.758972 | 4.869894 | 1.998584 | 3.021495 |
| 2 | Alumisols | Ratanakiri | 3.363890 | 10.171943 | 28.455453 | 3.358783 | 2.766808 | 2.727259 |
| 3 | Basic Lithosols | Stung Treng | 3.723706 | 11.293900 | 24.473039 | 5.282355 | 3.107344 | 3.924895 |
| 4 | Brown Alluvial Soils | Kampong Cham | 3.963988 | 10.998369 | 78.175078 | 5.120572 | 1.550838 | 2.636007 |
| 5 | Brown hydromorphics | Kampong Cham | 3.963988 | 10.998369 | 78.175078 | 5.120572 | 1.550838 | 2.636007 |
| 6 | Coastal Complex | Preah Sihanouk | 2.863650 | 13.421165 | 69.724251 | 5.188819 | 2.079503 | 3.885819 |
| 7 | Cultural hydromorphics | Kandal | 2.616167 | 13.037724 | 55.873999 | 4.075397 | 2.128712 | 3.636030 |
| 8 | Great Lake | Kampong Thom | 2.811989 | 10.232252 | 73.689641 | 3.222134 | 2.772821 | 2.854216 |
| 9 | Grey hydromorphics | Prey Veng | 3.108484 | 10.610191 | 36.856071 | 5.661638 | 3.359395 | 3.037581 |
7: Exploratory Analysis - Production Metrics¶
# Statistical summary of crop production
crop_stats = gdf_merged[numeric_cols].describe().T
display(crop_stats)
# Correlation analysis
correlation_matrix = gdf_merged[numeric_cols].corr()
fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(
correlation_matrix,
annot=True,
cmap="coolwarm",
center=0,
square=True,
ax=ax,
cbar_kws={"label": "Correlation"},
)
ax.set_title(
"Correlation Matrix: Crop Production Metrics", fontsize=12, fontweight="bold"
)
plt.tight_layout()
plt.show()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Rice_Yield_Ton_Ha | 17.0 | 3.010962 | 0.548113 | 2.541169 | 2.616167 | 2.616167 | 3.363890 | 3.963988 |
| Cassava_Yield_Ton_Ha | 17.0 | 12.447092 | 1.450653 | 10.171943 | 10.998369 | 13.037724 | 13.037724 | 14.828160 |
| Mango_Density | 17.0 | 52.412542 | 17.819997 | 22.713637 | 36.856071 | 55.873999 | 55.873999 | 78.175078 |
| Corn_Yield_Ton_Ha | 17.0 | 4.377685 | 0.694786 | 3.222134 | 4.075397 | 4.075397 | 5.120572 | 5.661638 |
| Soybean_Yield_Ton_Ha | 17.0 | 2.266858 | 0.480496 | 1.550838 | 2.128712 | 2.128712 | 2.320766 | 3.359395 |
| Rubber_Yield_Ton_Ha | 17.0 | 3.332137 | 0.460353 | 2.636007 | 2.854216 | 3.636030 | 3.636030 | 3.924895 |
| Sugarcane_Yield_Ton_Ha | 17.0 | 77.219401 | 11.664723 | 53.611591 | 69.578110 | 86.330635 | 86.330635 | 89.426018 |
| Pepper_Yield_Kg_Ha | 17.0 | 1934.939428 | 288.548371 | 1339.743663 | 1967.759186 | 1967.759186 | 2016.346543 | 2405.094983 |
| PalmOil_Yield_Ton_Ha | 17.0 | 16.014622 | 3.490961 | 10.050616 | 13.390298 | 18.870864 | 18.870864 | 18.971103 |
8: Distribution Analysis with Multiple Visualizations¶
# Distribution plots for each crop metric
fig, axes = plt.subplots(3, 3, figsize=(16, 12))
crops_to_plot = [
("Rice_Yield_Ton_Ha", "Rice Yield (t/ha)", "skyblue"),
("Cassava_Yield_Ton_Ha", "Cassava Yield (t/ha)", "lightgreen"),
("Mango_Density", "Mango Density (trees/ha)", "orange"),
("Corn_Yield_Ton_Ha", "Corn Yield (t/ha)", "gold"),
("Soybean_Yield_Ton_Ha", "Soybean Yield (t/ha)", "lightcoral"),
("Rubber_Yield_Ton_Ha", "Rubber Yield (t/ha)", "plum"),
("Sugarcane_Yield_Ton_Ha", "Sugarcane Yield (t/ha)", "khaki"),
("Pepper_Yield_Kg_Ha", "Pepper Yield (kg/ha)", "salmon"),
("PalmOil_Yield_Ton_Ha", "Palm Oil Yield (t/ha)", "lightsteelblue"),
]
for idx, (col, title, color) in enumerate(crops_to_plot):
ax = axes[idx // 3, idx % 3]
ax.hist(gdf_merged[col], bins=15, color=color, edgecolor="black", alpha=0.7)
ax.set_title(title, fontweight="bold")
ax.set_xlabel("Yield/Density")
ax.set_ylabel("Frequency")
ax.axvline(gdf_merged[col].mean(), color="red", linestyle="--", label="Mean")
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
9: Rule-Based Expert Classification¶
def classify_soil_crop_match(row):
"""
Expert-driven zone classification based on soil-crop suitability.
Rules:
- Prime Rice Zone: Fluvisol/Gleysol/Vertisol + Rice > 3.0 t/ha
- Prime Industrial Zone: Acrisol/Ferralsol/Nitosol + Cassava > 12.0 t/ha
- Fruit Zone: Suitable soils + Mango > 40 trees/ha
- Conservation: Lithosols or Water
- General Agriculture: All others
"""
soil_type = str(row["Soil_Type"]).lower()
rice_yield = row["Rice_Yield_Ton_Ha"]
cassava_yield = row["Cassava_Yield_Ton_Ha"]
mango_density = row["Mango_Density"]
if (
any(kw in soil_type for kw in ["fluvisol", "gleysol", "vertisol"])
and rice_yield > 3.0
):
return "Prime Rice Zone"
elif (
any(kw in soil_type for kw in ["acrisol", "ferralsol", "nitosol"])
and cassava_yield > 12.0
):
return "Prime Industrial Zone"
elif mango_density > 40.0 and not any(
kw in soil_type for kw in ["lithosol", "water"]
):
return "Fruit Zone"
elif any(kw in soil_type for kw in ["lithosol", "water"]):
return "Conservation"
else:
return "General Agriculture"
gdf_merged["Soil_Crop_Match"] = gdf_merged.apply(classify_soil_crop_match, axis=1)
# Display classification results
zone_summary = gdf_merged["Soil_Crop_Match"].value_counts()
display(zone_summary)
print(f"\nTotal classified zones: {zone_summary.sum()}")
Soil_Crop_Match Fruit Zone 12 Conservation 3 General Agriculture 2 Name: count, dtype: int64
Total classified zones: 17
10: Visualization - Rule-Based Zone Distribution¶
# Zone distribution visualization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
zone_counts = gdf_merged["Soil_Crop_Match"].value_counts()
colors = ["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd"]
zone_counts.plot(kind="bar", ax=ax1, color=colors[: len(zone_counts)])
ax1.set_title("Rule-Based Zone Distribution", fontsize=12, fontweight="bold")
ax1.set_ylabel("Number of Polygons")
ax1.set_xticklabels(ax1.get_xticklabels(), rotation=45, ha="right")
zone_pct = gdf_merged["Soil_Crop_Match"].value_counts(normalize=True) * 100
ax2.pie(
zone_pct, labels=zone_pct.index, autopct="%1.1f%%", colors=colors[: len(zone_pct)]
)
ax2.set_title("Zone Distribution by Percentage", fontsize=12, fontweight="bold")
plt.tight_layout()
plt.show()
11: Machine Learning - K-Means Clustering¶
# Prepare data for clustering
X = gdf_merged[numeric_cols].copy()
# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Evaluate optimal number of clusters using silhouette score
silhouette_scores = []
davies_bouldin_scores = []
cluster_range = range(2, 8)
for k in cluster_range:
kmeans_temp = KMeans(n_clusters=k, random_state=42, n_init=10)
labels = kmeans_temp.fit_predict(X_scaled)
silhouette_scores.append(silhouette_score(X_scaled, labels))
davies_bouldin_scores.append(davies_bouldin_score(X_scaled, labels))
# Plot evaluation metrics
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 4))
ax1.plot(
cluster_range,
silhouette_scores,
marker="o",
linewidth=2,
markersize=8,
color="green",
)
ax1.axvline(4, color="red", linestyle="--", alpha=0.7, label="Selected k=4")
ax1.set_xlabel("Number of Clusters")
ax1.set_ylabel("Silhouette Score")
ax1.set_title("Silhouette Score by Cluster Count", fontweight="bold")
ax1.legend()
ax1.grid(True, alpha=0.3)
ax2.plot(
cluster_range,
davies_bouldin_scores,
marker="s",
linewidth=2,
markersize=8,
color="purple",
)
ax2.axvline(4, color="red", linestyle="--", alpha=0.7, label="Selected k=4")
ax2.set_xlabel("Number of Clusters")
ax2.set_ylabel("Davies-Bouldin Score (lower is better)")
ax2.set_title("Davies-Bouldin Score by Cluster Count", fontweight="bold")
ax2.legend()
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Train final K-Means model with k=4
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
gdf_merged["Cluster_Label"] = kmeans.fit_predict(X_scaled)
print(f"K-Means model trained (k=4)")
print(
f"Silhouette Score: {silhouette_score(X_scaled, gdf_merged['Cluster_Label']):.4f}"
)
print(
f"Davies-Bouldin Score: {davies_bouldin_score(X_scaled, gdf_merged['Cluster_Label']):.4f}"
)
K-Means model trained (k=4) Silhouette Score: 0.5173 Davies-Bouldin Score: 1.0517
12: Automated Cluster Interpretation¶
# Type hints for Pylance: gdf_merged is GeoDataFrame from Cell 6, numeric_cols is list from Cell 6
# Analyze cluster characteristics using all 9 crops
cluster_profiles = gdf_merged.groupby("Cluster_Label")[numeric_cols].mean()
global_means = gdf_merged[numeric_cols].mean()
display(cluster_profiles.round(2))
# Dynamic cluster labeling - enhanced with all 9 crops
cluster_mapping = {}
for cluster_id in cluster_profiles.index:
row = cluster_profiles.loc[cluster_id]
ratios = {
"Rice Bowl Zone": (
row["Rice_Yield_Ton_Ha"] / global_means["Rice_Yield_Ton_Ha"]
if global_means["Rice_Yield_Ton_Ha"] > 0
else 0
),
"Industrial Crop Zone": (
row["Cassava_Yield_Ton_Ha"] / global_means["Cassava_Yield_Ton_Ha"]
if global_means["Cassava_Yield_Ton_Ha"] > 0
else 0
),
"Fruit Hub": (
row["Mango_Density"] / global_means["Mango_Density"]
if global_means["Mango_Density"] > 0
else 0
),
"Corn Production Zone": (
row["Corn_Yield_Ton_Ha"] / global_means["Corn_Yield_Ton_Ha"]
if global_means["Corn_Yield_Ton_Ha"] > 0
else 0
),
"Soybean Production Zone": (
row["Soybean_Yield_Ton_Ha"] / global_means["Soybean_Yield_Ton_Ha"]
if global_means["Soybean_Yield_Ton_Ha"] > 0
else 0
),
"Rubber Plantation Zone": (
row["Rubber_Yield_Ton_Ha"] / global_means["Rubber_Yield_Ton_Ha"]
if global_means["Rubber_Yield_Ton_Ha"] > 0
else 0
),
"Sugarcane Industry Zone": (
row["Sugarcane_Yield_Ton_Ha"] / global_means["Sugarcane_Yield_Ton_Ha"]
if global_means["Sugarcane_Yield_Ton_Ha"] > 0
else 0
),
"Spice Production Zone": (
row["Pepper_Yield_Kg_Ha"] / global_means["Pepper_Yield_Kg_Ha"]
if global_means["Pepper_Yield_Kg_Ha"] > 0
else 0
),
"Palm Oil Zone": (
row["PalmOil_Yield_Ton_Ha"] / global_means["PalmOil_Yield_Ton_Ha"]
if global_means["PalmOil_Yield_Ton_Ha"] > 0
else 0
),
}
max_feature = max(ratios, key=ratios.get)
cluster_mapping[cluster_id] = (
max_feature if ratios[max_feature] >= 1.0 else "Multi-Crop Zone"
)
gdf_merged["Cluster_Name"] = gdf_merged["Cluster_Label"].map(cluster_mapping)
print("Cluster Mapping (9-Crop Analysis):")
for cluster_id, label in sorted(cluster_mapping.items()):
cluster_row = cluster_profiles.loc[cluster_id]
ratios_display = {
"Rice Bowl Zone": (
cluster_row["Rice_Yield_Ton_Ha"] / global_means["Rice_Yield_Ton_Ha"]
if global_means["Rice_Yield_Ton_Ha"] > 0
else 0
),
"Industrial Crop Zone": (
cluster_row["Cassava_Yield_Ton_Ha"] / global_means["Cassava_Yield_Ton_Ha"]
if global_means["Cassava_Yield_Ton_Ha"] > 0
else 0
),
"Fruit Hub": (
cluster_row["Mango_Density"] / global_means["Mango_Density"]
if global_means["Mango_Density"] > 0
else 0
),
"Corn Production Zone": (
cluster_row["Corn_Yield_Ton_Ha"] / global_means["Corn_Yield_Ton_Ha"]
if global_means["Corn_Yield_Ton_Ha"] > 0
else 0
),
"Soybean Production Zone": (
cluster_row["Soybean_Yield_Ton_Ha"] / global_means["Soybean_Yield_Ton_Ha"]
if global_means["Soybean_Yield_Ton_Ha"] > 0
else 0
),
"Rubber Plantation Zone": (
cluster_row["Rubber_Yield_Ton_Ha"] / global_means["Rubber_Yield_Ton_Ha"]
if global_means["Rubber_Yield_Ton_Ha"] > 0
else 0
),
"Sugarcane Industry Zone": (
cluster_row["Sugarcane_Yield_Ton_Ha"]
/ global_means["Sugarcane_Yield_Ton_Ha"]
if global_means["Sugarcane_Yield_Ton_Ha"] > 0
else 0
),
"Spice Production Zone": (
cluster_row["Pepper_Yield_Kg_Ha"] / global_means["Pepper_Yield_Kg_Ha"]
if global_means["Pepper_Yield_Kg_Ha"] > 0
else 0
),
"Palm Oil Zone": (
cluster_row["PalmOil_Yield_Ton_Ha"] / global_means["PalmOil_Yield_Ton_Ha"]
if global_means["PalmOil_Yield_Ton_Ha"] > 0
else 0
),
}
top_crops = sorted(
[(k, v) for k, v in ratios_display.items()], key=lambda x: x[1], reverse=True
)[:3]
print(f" Cluster {cluster_id}: {label}")
print(
" Top 3 crops: "
+ ", ".join(
f"{crop.replace(' Zone', '')}: {ratio:.2f}x" for crop, ratio in top_crops
)
)
| Rice_Yield_Ton_Ha | Cassava_Yield_Ton_Ha | Mango_Density | Corn_Yield_Ton_Ha | Soybean_Yield_Ton_Ha | Rubber_Yield_Ton_Ha | Sugarcane_Yield_Ton_Ha | Pepper_Yield_Kg_Ha | PalmOil_Yield_Ton_Ha | |
|---|---|---|---|---|---|---|---|---|---|
| Cluster_Label | |||||||||
| 0 | 2.61 | 13.24 | 52.19 | 4.07 | 2.15 | 3.55 | 86.67 | 2016.35 | 18.06 |
| 1 | 3.09 | 10.20 | 51.07 | 3.29 | 2.77 | 2.79 | 68.24 | 1782.79 | 17.23 |
| 2 | 3.60 | 11.81 | 75.36 | 5.14 | 1.73 | 3.05 | 69.42 | 2139.91 | 12.60 |
| 3 | 3.58 | 12.22 | 31.03 | 5.27 | 2.82 | 3.33 | 62.63 | 1587.16 | 12.47 |
Cluster Mapping (9-Crop Analysis):
Cluster 0: Palm Oil Zone
Top 3 crops: Palm Oil: 1.13x, Sugarcane Industry: 1.12x, Rubber Plantation: 1.06x
Cluster 1: Soybean Production Zone
Top 3 crops: Soybean Production: 1.22x, Palm Oil: 1.08x, Rice Bowl: 1.03x
Cluster 2: Fruit Hub
Top 3 crops: Fruit Hub: 1.44x, Rice Bowl: 1.19x, Corn Production: 1.17x
Cluster 3: Soybean Production Zone
Top 3 crops: Soybean Production: 1.24x, Corn Production: 1.20x, Rice Bowl: 1.19x
13: Advanced Visualization - PCA Analysis¶
# Type hints for Pylance: gdf_merged (GeoDataFrame), X_scaled (ndarray), numeric_cols (list)
# Principal Component Analysis for visualization
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
# PCA by ML clusters - comprehensive color scheme for all cluster types
cluster_colors = {
"Rice Bowl Zone": "#e74c3c",
"Industrial Crop Zone": "#3498db",
"Fruit Hub": "#2ecc71",
"Corn Production Zone": "#f39c12",
"Soybean Production Zone": "#9b59b6",
"Rubber Plantation Zone": "#1abc9c",
"Sugarcane Industry Zone": "#e67e22",
"Spice Production Zone": "#c0392b",
"Palm Oil Zone": "#34495e",
"Multi-Crop Zone": "#95a5a6",
}
for cluster in gdf_merged["Cluster_Name"].unique():
mask = gdf_merged["Cluster_Name"] == cluster
axes[0].scatter(
X_pca[mask, 0],
X_pca[mask, 1],
label=cluster,
s=100,
alpha=0.6,
color=cluster_colors.get(cluster, "#808080"),
)
axes[0].set_xlabel(f"PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)")
axes[0].set_ylabel(f"PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)")
axes[0].set_title("PCA: ML Clusters", fontweight="bold")
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# PCA by rule-based zones
zone_colors = {
"Prime Rice Zone": "#1f77b4",
"Prime Industrial Zone": "#ff7f0e",
"Fruit Zone": "#2ca02c",
"Conservation": "#d62728",
"General Agriculture": "#9467bd",
}
for zone in gdf_merged["Soil_Crop_Match"].unique():
mask = gdf_merged["Soil_Crop_Match"] == zone
axes[1].scatter(
X_pca[mask, 0],
X_pca[mask, 1],
label=zone,
s=100,
alpha=0.6,
color=zone_colors.get(zone, "#808080"),
)
axes[1].set_xlabel(f"PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)")
axes[1].set_ylabel(f"PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)")
axes[1].set_title("PCA: Rule-Based Zones", fontweight="bold")
axes[1].legend()
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print(
f"Explained Variance: PC1={pca.explained_variance_ratio_[0]:.1%}, PC2={pca.explained_variance_ratio_[1]:.1%}"
)
Explained Variance: PC1=35.9%, PC2=20.4%
14: Cross-Analysis - Expert vs. Machine Learning Agreement¶
# Type hints for Pylance: gdf_merged (GeoDataFrame), numeric_cols (list), pd (DataFrame)
# Compare rule-based and ML classifications
comparison_matrix = pd.crosstab(
gdf_merged["Soil_Crop_Match"], gdf_merged["Cluster_Name"]
)
display(comparison_matrix)
# Feature importance by zone - all 9 crops
fig, axes = plt.subplots(3, 3, figsize=(16, 12))
for idx, col in enumerate(numeric_cols):
ax = axes[idx // 3, idx % 3]
zone_means = (
gdf_merged.groupby("Soil_Crop_Match")[col].mean().sort_values(ascending=False)
)
zone_means.plot(kind="barh", ax=ax, color="steelblue")
ax.set_title(f"{col} by Rule-Based Zone", fontweight="bold")
ax.set_xlabel("Mean Value")
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
| Cluster_Name | Fruit Hub | Palm Oil Zone | Soybean Production Zone |
|---|---|---|---|
| Soil_Crop_Match | |||
| Conservation | 0 | 1 | 2 |
| Fruit Zone | 3 | 8 | 1 |
| General Agriculture | 0 | 0 | 2 |
15: Interactive Geographic Visualization¶
# Type hints for Pylance: folium (module), gdf_merged (GeoDataFrame), base_path (Path)
# Create interactive Folium map with multiple enhancements
cambodia_center = [12.5, 104.9]
base_map = folium.Map(location=cambodia_center, zoom_start=7, tiles="OpenStreetMap")
# Define color schemes
rule_colors = {
"Prime Rice Zone": "#1f77b4",
"Prime Industrial Zone": "#ff7f0e",
"Fruit Zone": "#2ca02c",
"Conservation": "#d62728",
"General Agriculture": "#9467bd",
}
ml_colors = {
"Rice Bowl Zone": "#e74c3c",
"Industrial Crop Zone": "#3498db",
"Fruit Hub": "#2ecc71",
"Corn Production Zone": "#f39c12",
"Soybean Production Zone": "#9b59b6",
"Rubber Plantation Zone": "#1abc9c",
"Sugarcane Industry Zone": "#e67e22",
"Spice Production Zone": "#c0392b",
"Palm Oil Zone": "#34495e",
"Multi-Crop Zone": "#95a5a6",
}
# Calculate statistics for each zone
zone_stats = (
gdf_merged.groupby("Soil_Crop_Match")
.agg({"geometry": "count", "Rice_Yield_Ton_Ha": "mean"})
.rename(columns={"geometry": "polygon_count"})
)
zone_stats["area_km2"] = gdf_merged.groupby("Soil_Crop_Match")["geometry"].apply(
lambda geom: geom.to_crs("EPSG:3857").area.sum() / 1e6
)
# Get unique provinces and soil types
provinces_list = sorted(gdf_merged["Province_Name"].unique().tolist())
soil_types_list = sorted(gdf_merged["Soil_Type"].unique().tolist())
# Rule-based layer with enhanced popups and tooltips
rule_group = folium.FeatureGroup(name="Expert Rules", show=True)
for idx, row in gdf_merged.iterrows():
zone = row["Soil_Crop_Match"]
color = rule_colors.get(zone, "#808080")
popup_text = (
f"<b>{zone}</b><br>"
f"Soil: {row['Soil_Type']}<br>"
f"Province: {row['Province_Name']}<br>"
f"Rice Yield: {row['Rice_Yield_Ton_Ha']:.2f} t/ha"
)
tooltip_text = (
f"<b>{zone}</b><br>" f"{row['Soil_Type']}<br>" f"{row['Province_Name']}"
)
folium.GeoJson(
row["geometry"].__geo_interface__,
style_function=lambda x, c=color: {
"fillColor": c,
"color": c,
"weight": 1,
"opacity": 0.3,
"fillOpacity": 0.6,
},
popup=folium.Popup(popup_text, max_width=300),
tooltip=folium.Tooltip(tooltip_text, sticky=True),
).add_to(rule_group)
rule_group.add_to(base_map)
# ML cluster layer with enhanced popups and tooltips
ml_group = folium.FeatureGroup(name="ML Clusters", show=False)
for idx, row in gdf_merged.iterrows():
cluster = row["Cluster_Name"]
color = ml_colors.get(cluster, "#808080")
popup_text = (
f"<b>{cluster}</b><br>"
f"Soil: {row['Soil_Type']}<br>"
f"Province: {row['Province_Name']}<br>"
f"Cassava Yield: {row['Cassava_Yield_Ton_Ha']:.2f} t/ha"
)
tooltip_text = (
f"<b>{cluster}</b><br>" f"{row['Soil_Type']}<br>" f"{row['Province_Name']}"
)
folium.GeoJson(
row["geometry"].__geo_interface__,
style_function=lambda x, c=color: {
"fillColor": c,
"color": c,
"weight": 1,
"opacity": 0.3,
"fillOpacity": 0.6,
},
popup=folium.Popup(popup_text, max_width=300),
tooltip=folium.Tooltip(tooltip_text, sticky=True),
).add_to(ml_group)
ml_group.add_to(base_map)
# Add layer control
folium.LayerControl(collapsed=False).add_to(base_map)
# Add measurement tool (draw control)
from folium.plugins import MeasureControl
MeasureControl(
primary_length_unit="kilometers",
secondary_length_unit="miles",
primary_area_unit="sqkilometers",
secondary_area_unit="sqmiles",
).add_to(base_map)
# Add legend for rule-based zones
legend_html_rules = """
<div style="position: fixed;
bottom: 50px; left: 50px; width: 220px; height: auto;
background-color: white; border:2px solid grey; z-index:9999;
font-size:12px; padding: 10px; border-radius: 5px;">
<p style="margin: 0; font-weight: bold; text-decoration: underline;">Expert Rules Legend</p>
"""
for zone, color in rule_colors.items():
legend_html_rules += f'<p style="margin: 5px 0;"><i style="background:{color}; width: 18px; height: 18px; float: left; margin-right: 8px; border: 1px solid black;"></i>{zone}</p>'
legend_html_rules += "</div>"
# Add legend for ML clusters
legend_html_ml = """
<div style="position: fixed;
bottom: 50px; right: 50px; width: 220px; height: auto;
background-color: white; border:2px solid grey; z-index:9999;
font-size:12px; padding: 10px; border-radius: 5px; display: none;" id="ml-legend">
<p style="margin: 0; font-weight: bold; text-decoration: underline;">ML Clusters Legend</p>
"""
for cluster, color in ml_colors.items():
legend_html_ml += f'<p style="margin: 5px 0;"><i style="background:{color}; width: 18px; height: 18px; float: left; margin-right: 8px; border: 1px solid black;"></i>{cluster}</p>'
legend_html_ml += "</div>"
# Add statistics panel
stats_html = """
<div style="position: fixed;
bottom: 50px; left: 300px; width: 280px; height: auto;
background-color: white; border:2px solid grey; z-index:9999;
font-size:11px; padding: 10px; border-radius: 5px; max-height: 300px; overflow-y: auto;">
<p style="margin: 0 0 8px 0; font-weight: bold; text-decoration: underline;">Zone Statistics</p>
"""
for zone, row in zone_stats.iterrows():
stats_html += f"""
<div style="margin: 5px 0; padding: 5px; border-bottom: 1px solid #eee;">
<b style="color: {rule_colors.get(zone, '#808080')}">{zone}</b><br>
Polygons: {int(row['polygon_count'])}<br>
Area: {row['area_km2']:.1f} km²
</div>
"""
stats_html += "</div>"
# Add search/filter panel
search_html = f"""
<div style="position: fixed;
top: 120px; right: 50px; width: 280px;
background-color: white; border:2px solid grey; z-index:9999;
font-size:11px; padding: 10px; border-radius: 5px; max-height: 300px; overflow-y: auto;">
<p style="margin: 0 0 8px 0; font-weight: bold; text-decoration: underline;">Search & Filter</p>
<div style="margin-bottom: 10px;">
<label style="font-weight: bold; display: block; margin-bottom: 3px;">Province:</label>
<input type="text" id="province-search" placeholder="Search provinces..."
style="width: 100%; padding: 5px; font-size: 11px;">
<div id="province-list" style="max-height: 100px; overflow-y: auto; margin-top: 5px;">
{"".join([f'<label style="display: block; margin: 3px 0;"><input type="checkbox" class="prov-filter" value="{p}"> {p}</label>' for p in provinces_list[:10]])}
{'<span style="color: #666; font-size: 10px;">... and ' + str(len(provinces_list) - 10) + ' more</span>' if len(provinces_list) > 10 else ''}
</div>
</div>
<div>
<label style="font-weight: bold; display: block; margin-bottom: 3px;">Soil Type:</label>
<input type="text" id="soil-search" placeholder="Search soil types..."
style="width: 100%; padding: 5px; font-size: 11px;">
<div id="soil-list" style="max-height: 100px; overflow-y: auto; margin-top: 5px;">
{"".join([f'<label style="display: block; margin: 3px 0;"><input type="checkbox" class="soil-filter" value="{s}"> {s}</label>' for s in soil_types_list[:8]])}
{'<span style="color: #666; font-size: 10px;">... and ' + str(len(soil_types_list) - 8) + ' more</span>' if len(soil_types_list) > 8 else ''}
</div>
</div>
</div>
"""
base_map.get_root().html.add_child(folium.Element(legend_html_rules))
base_map.get_root().html.add_child(folium.Element(legend_html_ml))
base_map.get_root().html.add_child(folium.Element(stats_html))
base_map.get_root().html.add_child(folium.Element(search_html))
# Add title
title_html = """
<div style="position: fixed;
top: 10px; left: 50px; width: 400px;
background-color: white; border:2px solid grey; z-index:9999;
font-size:14px; font-weight: bold; padding: 10px; border-radius: 5px;">
Cambodia Agricultural Zoning Map<br>
<span style="font-size: 11px; font-weight: normal;">
• Hover over polygons for quick info
• Click for details
• Use ruler icon (top-right) to measure distances/areas
• Filter by province or soil type on the right
</span>
</div>
"""
base_map.get_root().html.add_child(folium.Element(title_html))
# Fit map to bounds of all polygons
if len(gdf_merged) > 0:
bounds = gdf_merged.geometry.total_bounds
base_map.fit_bounds([[bounds[1], bounds[0]], [bounds[3], bounds[2]]])
# Save map
output_map = base_path / "Cambodia_Agricultural_Zoning_Map.html"
base_map.save(str(output_map))
print(f"Interactive map saved to: {output_map}")
print(f"Map bounds: {bounds}")
print(f"Provinces available: {len(provinces_list)}")
print(f"Soil types available: {len(soil_types_list)}")
print(f"\nZone Statistics:")
print(zone_stats.to_string())
display(base_map)
Interactive map saved to: D:\National Soil Information and Land Suitability Evaluation System in Cambodia\Cambodia_Agricultural_Zoning_Map.html
Map bounds: [102.33962124 10.40077472 107.63151382 14.63656554]
Provinces available: 9
Soil types available: 17
Zone Statistics:
polygon_count Rice_Yield_Ton_Ha area_km2
Soil_Crop_Match
Conservation 3 3.393673 65870.802327
Fruit Zone 12 2.877746 104238.446849
General Agriculture 2 3.236187 21374.704054
16: Results Export & Documentation¶
# Type hints for Pylance: base_path (Path), gdf_merged (GeoDataFrame), numeric_cols (list)
# Export results to multiple formats
export_cols = [
"Soil_Type",
"Province_Name",
"Rice_Yield_Ton_Ha",
"Cassava_Yield_Ton_Ha",
"Mango_Density",
"Soil_Crop_Match",
"Cluster_Label",
"Cluster_Name",
]
# GeoJSON export
geojson_path = base_path / "Cambodia_Agricultural_Zones_Results.geojson"
gdf_merged.to_file(str(geojson_path), driver="GeoJSON")
print(f"GeoJSON exported: {geojson_path}")
# CSV export
csv_path = base_path / "Cambodia_Agricultural_Zones_Analysis.csv"
gdf_merged[export_cols].to_csv(str(csv_path), index=False)
print(f"CSV exported: {csv_path}")
# Summary statistics
summary = gdf_merged.groupby("Soil_Crop_Match")[numeric_cols].agg(
["mean", "min", "max"]
)
display(summary.round(2))
GeoJSON exported: D:\National Soil Information and Land Suitability Evaluation System in Cambodia\Cambodia_Agricultural_Zones_Results.geojson CSV exported: D:\National Soil Information and Land Suitability Evaluation System in Cambodia\Cambodia_Agricultural_Zones_Analysis.csv
| Rice_Yield_Ton_Ha | Cassava_Yield_Ton_Ha | Mango_Density | Corn_Yield_Ton_Ha | Soybean_Yield_Ton_Ha | Rubber_Yield_Ton_Ha | Sugarcane_Yield_Ton_Ha | Pepper_Yield_Kg_Ha | PalmOil_Yield_Ton_Ha | |||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| mean | min | max | mean | min | max | mean | min | max | mean | min | max | mean | min | max | mean | min | max | mean | min | max | mean | min | max | mean | min | max | |
| Soil_Crop_Match | |||||||||||||||||||||||||||
| Conservation | 3.39 | 2.54 | 3.92 | 13.62 | 11.29 | 14.83 | 26.32 | 22.71 | 31.76 | 4.71 | 3.99 | 5.28 | 2.48 | 2.00 | 3.11 | 3.26 | 2.83 | 3.92 | 70.87 | 53.61 | 89.43 | 1926.56 | 1580.10 | 2405.09 | 12.99 | 10.84 | 16.52 |
| Fruit Zone | 2.88 | 2.62 | 3.96 | 12.50 | 10.23 | 13.42 | 62.23 | 55.87 | 78.18 | 4.27 | 3.22 | 5.19 | 2.08 | 1.55 | 2.77 | 3.43 | 2.64 | 3.89 | 80.01 | 59.51 | 86.33 | 1958.46 | 1339.74 | 2387.04 | 17.31 | 11.01 | 18.97 |
| General Agriculture | 3.24 | 3.11 | 3.36 | 10.39 | 10.17 | 10.61 | 32.66 | 28.46 | 36.86 | 4.51 | 3.36 | 5.66 | 3.06 | 2.77 | 3.36 | 2.88 | 2.73 | 3.04 | 70.03 | 64.71 | 75.34 | 1806.37 | 1386.91 | 2225.83 | 12.77 | 10.05 | 15.49 |
17: Final Report & Key Insights¶
# Type hints for Pylance: gdf_merged (GeoDataFrame), pd (DataFrame), numeric_cols (list)
# Generate comprehensive summary
report_data = []
for zone in sorted(gdf_merged["Soil_Crop_Match"].unique()):
zone_data = gdf_merged[gdf_merged["Soil_Crop_Match"] == zone]
report_data.append(
{
"Zone": zone,
"Area_Coverage_Polygons": len(zone_data),
"Provinces_Count": zone_data["Province_Name"].nunique(),
"Avg_Rice_Yield_t_ha": zone_data["Rice_Yield_Ton_Ha"].mean(),
"Avg_Cassava_Yield_t_ha": zone_data["Cassava_Yield_Ton_Ha"].mean(),
"Avg_Mango_Density": zone_data["Mango_Density"].mean(),
"Avg_Corn_Yield_t_ha": zone_data["Corn_Yield_Ton_Ha"].mean(),
"Avg_Soybean_Yield_t_ha": zone_data["Soybean_Yield_Ton_Ha"].mean(),
"Avg_Rubber_Yield_t_ha": zone_data["Rubber_Yield_Ton_Ha"].mean(),
"Avg_Sugarcane_Yield_t_ha": zone_data["Sugarcane_Yield_Ton_Ha"].mean(),
"Avg_Pepper_Yield_kg_ha": zone_data["Pepper_Yield_Kg_Ha"].mean(),
"Avg_PalmOil_Yield_t_ha": zone_data["PalmOil_Yield_Ton_Ha"].mean(),
}
)
report_df = pd.DataFrame(report_data).set_index("Zone")
display(report_df.round(2))
# Overall statistics
print("\nProject Statistics:")
print(f"Total Soil Polygons: {len(gdf_merged)}")
print(f"Provinces Analyzed: {gdf_merged['Province_Name'].nunique()}")
print(f"Soil Types Identified: {gdf_merged['Soil_Type'].nunique()}")
print(f"Rule-Based Zones: {gdf_merged['Soil_Crop_Match'].nunique()}")
print(f"ML Clusters: {gdf_merged['Cluster_Name'].nunique()}")
print(f"\nCoordinate Reference System: {gdf_merged.crs}")
| Area_Coverage_Polygons | Provinces_Count | Avg_Rice_Yield_t_ha | Avg_Cassava_Yield_t_ha | Avg_Mango_Density | Avg_Corn_Yield_t_ha | Avg_Soybean_Yield_t_ha | Avg_Rubber_Yield_t_ha | Avg_Sugarcane_Yield_t_ha | Avg_Pepper_Yield_kg_ha | Avg_PalmOil_Yield_t_ha | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Zone | |||||||||||
| Conservation | 3 | 3 | 3.39 | 13.62 | 26.32 | 4.71 | 2.48 | 3.26 | 70.87 | 1926.56 | 12.99 |
| Fruit Zone | 12 | 4 | 2.88 | 12.50 | 62.23 | 4.27 | 2.08 | 3.43 | 80.01 | 1958.46 | 17.31 |
| General Agriculture | 2 | 2 | 3.24 | 10.39 | 32.66 | 4.51 | 3.06 | 2.88 | 70.03 | 1806.37 | 12.77 |
Project Statistics: Total Soil Polygons: 17 Provinces Analyzed: 9 Soil Types Identified: 17 Rule-Based Zones: 3 ML Clusters: 3 Coordinate Reference System: EPSG:4326
Geospatial Crop Targeting System for Cambodia¶
Rule-Based Zoning & Machine Learning Clustering Analysis¶
Project: National Soil Information and Land Suitability Evaluation System
Objective: Merge geospatial soil data with crop production statistics to identify optimal agricultural zones using both expert-driven rules and unsupervised machine learning.
📋 Table of Contents¶
- Problem Definition - Objectives & Success Metrics
- Setup & Imports - Environment Configuration
- Data Acquisition - Load Geospatial & Statistical Data
- Exploratory Data Analysis (EDA) - Statistical Summaries & Distributions
- Data Cleaning & Validation - Handle Missing Values & Inconsistencies
- Data Preprocessing - CRS Alignment & Standardization
- Data Merge - Integrate Soil & Production Datasets
- Feature Engineering - Rule-Based Expert Classification
- Exploratory Visualization - Distributions & Correlations
- Machine Learning Clustering - K-Means Training & Analysis
- Cluster Interpretation - Automated Labeling & Characterization
- Advanced Visualization - Heatmaps, Correlations, PCA
- Interactive Mapping - Folium with Dual-Layer Control
- Model Evaluation & Comparison - Expert vs. ML Agreement Analysis
- Insights & Recommendations - Actionable Findings
- Results Export - Multi-Format Output Generation
Executive Summary¶
This notebook demonstrates a comprehensive data science workflow:
- Rule-Based Expert System: Applies domain knowledge to classify zones (Rice, Industrial Crops, Fruit, Conservation)
- Machine Learning (K-Means): Discovers underlying patterns in crop production data automatically
- Interactive Visualization: Enables stakeholder exploration and decision-making
The results are visualized on an interactive map with layer control for comparative analysis.
1️⃣ PROBLEM DEFINITION & BUSINESS CONTEXT¶
Objective¶
Develop a data-driven system to classify Cambodia's agricultural zones based on soil characteristics and crop production data, enabling targeted resource allocation and policy decisions.
Success Metrics¶
- ✅ Successfully merge geospatial (17 soil polygons) with statistical (24 provinces) datasets
- ✅ Create interpretable zone classifications for stakeholder decision-making
- ✅ Achieve >80% agreement between rule-based and ML approaches (indicates robust patterns)
- ✅ Generate interactive visualizations for policy makers
- ✅ Document recommendations for each zone type
Key Constraints¶
- Geospatial: Shapefile in UTM (EPSG:32648) requires reprojection for web mapping
- Data: Limited production data; synthetic generation needed for demonstration
- Scope: Focus on rice, cassava, and mango as primary crops
- Stakeholders: Agricultural Ministry, provincial governors, donor organizations
Expected Outcomes¶
- Interactive web map with dual classification layers
- Geographic prioritization of development investments
- Alignment of policy with agronomic suitability